ABSTRACT
Data-intensive applications are becoming increasingly popular. However, only a few of them with high volume can afford dedicated hardware acceleration (such as Neural Network Processor, or NPU) or platform-specific software implementation (such as Tensorflow running on GPU). In this paper, we propose a hardware and software transparent framework for the acceleration of general-purpose data-intensive applications. Our framework is based on a key insight that most data-intensive applications spend the vast majority of their execution time on some inner loops with abundant opportunities for Data-Level Parallelism (DLP). In particular, we propose SALAD, a static analyzer for loop acceleration by exploiting DLP in hot loops under the LLVM (LLVM compiler infrastructure) framework. In contrast to traditional DLP exploration techniques, SALAD is both software and architectural transparent, without the need to change either the source code or binary code, and does not need vectorized instruction set architecture (ISA) extensions. Instead, it directly works on the program binary code and generates a profile for DLP opportunities in the binary. This profile will be fed to the hardware accelerator transparently to speed up execution. With the experiments result, we estimate that the DLP information provided by SALAD could result in 3.6x-60.2x speedups on a set of benchmarks, depending on their inherent DLP.
- N. P. Jouppi, C. Young, N. Patil In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA'17), New York, NY, USA, 2017, pp. 1–12.Google ScholarDigital Library
- GCC Team, Dorit Naishlos. "Autovectorization in GCC," Retrieved from https://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf, June 2004.Google Scholar
- Gil Rapaport and Ayal Zaks, "Introducing VPlan to the Loop Vectorizer," European LLVM Developers' Meeting 2017 Retrieved from http://llvm.org/devmtg/2017-03//assets/slides/introducing_vplan_to_the_loop_vectorizer.pdfGoogle Scholar
- M. D. Ernst, "Static and dynamic analysis: Synergy and duality," in Proc. Workshop Dynamic Anal., May 9, 2003, pp. 24–27.Google Scholar
- W. Heirman, D. Stroobandt, N. R. Miniskar, R. Wuyts and F. Catthoor, "PinComm: Characterizing Intra-application Communication for the Many-Core Era," 2010 IEEE 16th International Conference on Parallel and Distributed Systems, Shanghai, 2010, pp. 500-507Google Scholar
- I. Ashraf, N. Khammassi, M. Taouil, and K. Bertels, "Memory and Communication Profiling for Accelerator-Based Platforms," IEEE Transactions on Computers, vol. 67, no. 7, pp. 934–948, Jul. 2018.Google ScholarDigital Library
- K. Asanovic, D. A. Patterson, and C. Celio, "The berkeley out-of-order machine (boom): An industry-competitive, synthesizable, parameterized RISC-V processor," University of California at Berkeley Berkeley United States, Tech. Rep., 2015.Google Scholar
- S. Srinath, B. Ilbeyi, M. Tan, G. Liu, Z. Zhang and C. Batten, "Architectural Specialization for Inter-Iteration Loop Dependence Patterns," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 583-595.Google Scholar
- Karthikeyan Sankaralingam, S. W. Keckler, W. R. Mark and D. Burger, "Universal mechanisms for data-parallel architectures," Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., San Diego, CA, USA, 2003, pp. 303-314.Google Scholar
- T. Nowatzki, V. Gangadhar, N. Ardalani and K. Sankaralingam, "Stream-dataflow acceleration," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA'17), Toronto, ON, 2017, pp. 416-429.Google Scholar
Recommendations
Enabling SIMT Execution Model on Homogeneous Multi-Core System
Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential ...
DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing
The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization. By dynamically specializing frequently executing regions and applying parallelism mechanisms, DySER provides ...
Converting massive TLP to DLP: a special-purpose processor for molecular orbital computations
CF '07: Proceedings of the 4th international conference on Computing frontiersWe propose an application specific processor for computational quantum chemistry. The kernel of interest is the computation of electron repulsion integrals (ERIs), which vary in control flow with different input data. This lack of uniformity limits the ...
Comments